Skip to content

gh-95555: Add the CATEGORY_UCD opcode and the simple enumerated properties#153023

Open
serhiy-storchaka wants to merge 2 commits into
python:mainfrom
serhiy-storchaka:re-prop-t1
Open

gh-95555: Add the CATEGORY_UCD opcode and the simple enumerated properties#153023
serhiy-storchaka wants to merge 2 commits into
python:mainfrom
serhiy-storchaka:re-prop-t1

Conversation

@serhiy-storchaka

@serhiy-storchaka serhiy-storchaka commented Jul 4, 2026

Copy link
Copy Markdown
Member

First of four stacked PRs that complete \p{...} support with the properties backed by the Unicode Character Database, matched in C through a unicodedata capsule. This PR adds the machinery and the simplest tier: the enumerated properties stored as a single byte in the per-character record — Bidi_Class (bc), East_Asian_Width (ea), Grapheme_Cluster_Break (gcb) and Indic_Conjunct_Break (incb).

unicodedata exports the _ucd_re_CAPI capsule (the \N{...} precedent) and _ucd_re_info(), which lists the property selectors and value names. The parser resolves a property name and value to a selector and value index; the new CATEGORY_UCD opcode packs (negate, property, value) into one operand and is matched in C by sre_category_ucd(), which compares the index returned by the capsule. A negated single value is one charset item, so \P{bc=AL} composes inside a set.

The follow-up PRs add, in order: the numeric and binary properties (ccc, Bidi_Mirrored, Extended_Pictographic), the computed properties (Block, Decomposition_Type, Numeric_Type) and the remaining General_Category values and groups (Ll, Lo, M/P/S) with POSIX punct.

… properties

Introduce the unicodedata capsule used to match \p{...} properties that need
the Unicode Character Database, starting with the simplest tier: the
enumerated properties stored as a single byte in the per-character record --
Bidi_Class (bc), East_Asian_Width (ea), Grapheme_Cluster_Break (gcb) and
Indic_Conjunct_Break (incb).

unicodedata exports the _ucd_re_CAPI capsule (the \N{...} precedent) and
_ucd_re_info(), which lists the property selectors and value names.  The new
CATEGORY_UCD opcode packs (negate, property, value) and is matched in C by
sre_category_ucd().  A negated single value is one charset item, so \P{bc=AL}
composes inside a set.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@read-the-docs-community

read-the-docs-community Bot commented Jul 4, 2026

Copy link
Copy Markdown

The CATEGORY_UCD support adds three process-global tables that are set
once and effectively constant: the cached capsule pointer in
sre_category_ucd(), the capsule struct built by unicodedata_create_re_capi(),
and the enumerated-property table in unicodedata_ucd_re_info().  List them
in the c-analyzer ignore file, as done for the name-lookup capsule.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant